18.02 LECTURE NOTES ON PROBABILITY 



Continuous Probability 

Discrete probability describes games of chance with a hst of outcomes, such as whether 
a coin lands on heads or tails, or a die lands on one of the values 1 through 6. In contrast, 
continuous probability concerns quantities that can take on all possible values in a continuum. 
In the discrete case, the probabihty of an outcome or an average value is expressed as a 
sum, whereas in the continuous case these values are described using integrals. 

The basic equation of probability theory is 

PART 

PROBABILITY = 



WHOLE 

Note that the probability is a number between and 1. 

Example 1. We say that x is uniformly distributed on the interval < a; < 10 if any 
value of X is as likely as any other. In this case, the probability that 1 < a; < 7 is 

PART 7-1 6 
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WHOLE 10 10 
If < a < 6 < 10, then probability that a < a; < 6, is given by the formula 

P{a<x<b)^-J^ dx= — 

The probability density of x is 1/10 on < a; < 10 and zero outside this interval. More 
generally, x can be distributed by a nonnegative function g{x) so that 

P{a <x <b) = f g{x)dx 

J a 

Because the total probability is 1, we need 

/oo 
g{x)dx = 1 
-oo 

In our example, g{x) = 1/10 on < a; < 10 and g{x) = outside this interval. With 
continuous variables like x it does not matter whether we include the ends x = a and x = b 
or not. We interpret the events x = a and a; = 6 as happening with zero probability. Thus, 

P(a <x <b) = P{a <x<b). 

Example 2. Consider a point (a;, y) distributed according to the weighting or density 
6{x, y) = x"^ + y^ on the unit disk D, a;^ + < 1. The probability that (a;, y) is in a portion 
i? of D is 

^ , , , . PART mass(it!) 1 U . 

^) = WHOLE = massM = m\ Ij''' 

where M = J J 6dA is the total mass of D. We also write 



M ' 
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In other words, the probabUity density g{x,y) = 6{x,y)/M is normalized so that the total 
integral is 1. 



j j 9{x,y)dxdy = 1. 
Using polar coordinates, 

M = j j 5dA = r^rdrde = | 



If R is the ring a < r <b, then 



P{a <r < 



b)= / —r^rdrde = / j^'^^dr / dO = 4r^dr 



Thus, by integrating in the 6 variable, we obtain the probability density in the remaining 

variable r. The probability density of r in this example is g{r) = 4r^ for < r < 1 and 
g{r) = outside this interval. As usual, the total probability 



P(0 < r < 1) = / Ar^dr = 1 
Ja 



/oo 
dx 
-oo 

is very important in probability theory. It gives us the normalizing factor to use when 
defining 

G(x) = le-' 

as a probability density, that is, M is the constant we need in order that 

/oo 
G{x)dx = 1 
-oo 

The function G{x) is the well-known bell curve or normal distribution. 

/■°° _ 2 

To compute M = e ^ dx, rewrite in a clever way, as in lecture: 

J — CO 

iiy''"') (/>"""*)=/! (£-'"^-)»-''* 

/oo poo poo poo 

/ e-''\-y^ dxdy = / / e-'''-y'dxdy 
-OO J — oo •/ — oo J — c 



-oo./— oo J —OC J —QO 

r27T poo p2-K 



Jo Jo 



e-'"rdrde= / ]-de = n 
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Therefore, = tt and M = In all, we have 

1 1 \ 2 

G{x) = ; / ^e"^ dx = 1 

The importance of this function to probability was discovered by Abraham de Moivre around 
1700. We have used the name G{x) in honor of Karl Friedrich Gauss, who laid the foun- 
dations of probability theory (along with the method of least squares) in the process of 
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finding better ways to make use of measurements in astronomy. Any function of the form 

g-os +bx+c jg called a Gaussian. 

In order for you to recognize the normal distribution in the future, you will need to 
recognize all its scalings, related to the parameters a, b and c above. The scaling of the 
Gaussian will be discussed in the optional section at the end of these Notes. That section 
is not necessary for 18.02, but it will give you a brief look at some tools and terminology in 
the theory of probability. 



Conditional probability 



To choose < a; < 1 and < y < 1 independently "at random" means 

P{{x,y) in R) = area(i?); R in unit square 
Thus P{x > 1/2) = 1/2. But the probability changes when we add information: 

P{x >l/2\xy = 1/1000) =? 

This notation means the probability that x > 1/2 given that we already know xy = 1/1000. 

It is known as a conditional probability. 

Computing conditional probabilities of this kind is very closely related to computing 
integrals using a change of variable. The conditional probability density turns out to be the 
Jacobian factor, renormalized so that the total probability is one. 

Recall that if u = x and v = xy, then on Thursday we showed that 

dxdy = [ ( —dv 
JO Jo Jv 

Note that the interesting parts of this calculation arc that the Jacobian J = \/u and that 
the range of u with v fixed \s v <u < 1. Consider any fixed value xy = v = vq, and consider 
the inner integral 

—du 

The idea is that if xy = vq, then vq < u < 1 \s the full range for u = x and that Jacobian 
factor I/m is the probability density on that interval. We need to need to normalize by this 
total mass. 



1 

M = / -du = - Xnivo) 

For xy = vq fixed, this means that a; > 1 and x < vq never happen. In between, vq < a < 
b<l, 

P(a<x<b\xy = vo) = = 
^ - - ' ^ ' M whole 

Therefore, with a = 1/2, b = 1, vq = 1/2^° = 1/1024 (close enough to 1/1000) 



P{x > 1/2 I a;t/ = 1/2^") 



/i/2 if ^ -ln(l/2) ^ In 2 ^ 1^ 
/i/2io1f -ln(l/2i0) 10 In 2 10 



Why this formula works. There is a difficulty with fixing xy = Vq. The condition 
confines us to a single curve, which has zero area and hence zero probability. So when we 
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divide the part by the whole, we are dividing zero by zero. To repair this difficulty, consider 
a small interval Vq < v < Vo + Av, carry out the computation of the ratio, and take the limit 
as Av tends to zero. This is the correct way of thinking about what it means to know that 
xy = .001 because any computation that reports the value .001 does it with a roundoff error 
At;. For practical purposes, the limit of the ratio as Av tends to zero is the same as the 
ratio at any fixed band of values of v corresponding to ±Av = ±10~^ or smaller. (Matlab 
is accurate to 15 digits.) 

Consider the area of the whole region. 



area(i'o < xy < vq + Av) = i 

J Vo '^V 



du , 
— dv 
u 



On a very short interval Vq <v <Vq + Av, the inner integral is nearly constant: 

'■^ du du 



Jv Jvo 



Therefore, 

rvo+Av .1 ^ .vo+Av .1 ^ .1 ^ 

Jvq Jv Jvo Jva ^ J vo ^ 

Similarly, for vq + Av < a < 6 < 1, the area of the part is 

rVo+Av 

area(o < a; < 6 and vq < xy <vo + Av) = / — dv = At 

Jvo Ja U 

So the factor At; cancels in the ratio of the part to the whole. 



P{a <x<b\vo<v<Vo + Av) s 
In the limit as Av — > 0, 

P{a < X < b \ V = Vo) 



rb du du 

J a u _ Jg u 

Av f'^ ^ ^ 

•fvn u Jvo u 



j-b du 
J a u 
fl du 
Jvo u 



There are pitfalls in dealing with zero area sets like xy = vq. For example, one may 

be tempted to use the ratio of the arclength of the curve on the portion a < 3; < & to the 
total arclength of xy = vq. This gives the wrong answer. The reason is that the band 
Vo < xy < Vo + Av does not have uniform thickness. If it did, then the weighting or density 
would be equivalent to arclength. Put another way, if one tries to partition the square 
into subsets of uniform thickness around curves of the form xy = vq this cannot be done 
without overlapping bands or bands that miss substantial sections. By contrast, the bands 
Vo l£ xy < vo + Av are compatible with partitioning without gaps or overlaps: use pieces of 
the form kAv < xy < {k + l)Av, k = 0,1, 
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Expected value, variance, and standard deviation: 
Rescaling the normal distribution. 



This section is optional reading. It introduces a few standard notions and terminology 
from probability theory. Recall that if a variable x is distributed with a density g{x), then 

P{a <x <b) = I g{x)dx, g{x) > 
Since the total integral is 1, the average value or mean of x is 

POO 

tJ,= xg{x)dx 



(fi for mean). In probability, the upper case letter X denotes something called a random 
variable, which can be viewed as a quantity that will vary depending on each sampling of 
the variable.^ The mean or expected value of X, E{X), is a theoretical value for what one 
would expect to get if one averaged over several samples of the variable X. This quantity 
is the same as the average value or mean value of x, 

/oo 
xg{x)dx = jjL 
-oo 

One can take the expected value of any function of f{X). The formula is 

/oo 
mg{x)dx 
-oo 

Again, this is a weighted average of /. The expected value of X is just the special case 
f{X) = X. It is also interesting to evaluate the expectation of functions like f{X) = X^, 
called the fcth moments of X, and f{X) = e*-^ . 

The variance V{X) is a measure of the likelihood that X is far from its mean. It is the 
average of the square of the distance from X to its mean, (a; — /x)^, 



/oo 
(x - fifgix)dx 
-oo 

Because the variance involves the square of distance, it is natural to take a square root (as 
in the Pythagorean theorem). The standard deviation is defined by 



a{X) = ^V{X) 

We can now explain the scaling of the normal distribution. There is a probability density 
for each a > 0, 

5.(x) = ^e--V2<.^ 



/27rcr 

It has three properties: 



f 

J — ( 



g„{x)dx = 1 (total integral 1), (1) 



'^For instance, in Matlab the command rand(3,4) produces a 3 X 4 matrix with each element chosen 
uniformly distributed between and 1. The entries arc also independent of each other. Independence is a 
key concept in probability. But it would take us too far afield to discuss it in any detail. 
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xg„{x)dx = (mean 0), (2) 
x^gcr{x)dx = (variance cr^). (3) 



oo 

oo 



To shift the mean to /x, take 



g.(:r-M)=^e-(--'')V2-= 
V27rcr 



The standard deviation a or the variance measures how far the distribution is from its 
mean value. For larger cr, ga(x) is flatter and has a smaller maximum value Xj^f^a. In 
other words, it is more weighted towards larger values, and X"^ is more likely to take on 
larger values. The formula for the variance above is an exact, quantitative expression of 
this qualitative comparison between the shapes of the graphs of the densities for different 
values of a. 

1 _ 2 I- 

The function G{x) = —j=e ^ considered in Example 3 above equals ga for a = l/v2. 
Thus G is the normal distribution with mean and standard deviation (variance 1/2). 

To confirm (1), (2), and (3), recall that we already showed that 



e dx 

Change variables by a; = az, dx = adz, to get 



— oo 



/oo 
e~°'^''^adz = -v/tt 
-oo 



2 1 

Puttmg a = 

g-zV2a^ 1 ^^^^ (4) 
-oo V2<T 



r 

J — c 



and dividing by ^/n gives (1). 

Next, multiply (4) by \/2cr to obtain 



J — C 



e-^ dz = V2^a (5) 
Differentiating the left side of (5) with respect to a gives 

-J poo poo c\ poo 2 

^ / e-V2.^rf, = / ^e-^^/"^'dz = / ^e-V2.^rf, 

do- 7_oo J-oo OCT O-^ 

On the other hand the derivative of right hand side of (5) with respect to cr is y/2n. Hence, 

^e-^'/^^'dz = V^TT 

— CO ^ 



Dividing by v27r and multiplying by a yields (3). 

Finally, (2) is obvious because the integrand is odd. To confirm that the mean is ji for 
the density gfj{x — fj)^ change variables to z = x — ji. Then, using (2) and (1), 



/oo poo poo poo 

xga{x-ii)dx= {z + ii)gcr{z)dz = zga{z)dz + /j, g„ 
-oo J — CO J — oo J —oo 



{z)dz = + jjl = fl 
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